Efficient Parallel Skyline Query Processing for High-Dimensional Data

نویسندگان

  • Mingjie Tang
  • Yongyang
  • Yu
  • Walid G. Aref
  • Qutaibah M. Malluhi
  • Mourad Ouzzani
چکیده

Given a set of multidimensional data points, skyline queries retrieve those points that are not dominated by any other points in the set. Due to the ubiquitous use of skyline queries, such as in preference-based query answering and decision making, and the large amount of data that these queries have to deal with, enabling their scalable processing is of critical importance. However, there are several outstanding challenges that have not been well addressed. More specifically, in this paper, we are tackling the data straggler and data skew challenges introduced by distributed skyline query processing as well as the ensuing high computing cost of merging skyline candidates. We thus introduce a new efficient three-phase approach for large scale processing of skyline queries. In the first preprocessing phase, the data is partitioned along the Z-order curve. We utilize a novel data partitioning approach that formulates data partitioning as an optimization problem to minimize the size of intermediate data. In the second phase, each computation node partitions the input data points into separate sets, and then performs the skyline computation on each set to produce skyline candidates in parallel. In the final phase, we build an index and employ an efficient algorithm to merge the generated skyline candidates. Extensive experiments demonstrate that the proposed skyline algorithm achieves more than one order of magnitude enhancement in performance compared to existing state-of-the-art approaches.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Geometry-Based Distributed Spatial Skyline Queries in Wireless Sensor Networks

Algorithms for skyline querying based on wireless sensor networks (WSNs) have been widely used in the field of environmental monitoring. Because of the multi-dimensional nature of the problem of monitoring spatial position, traditional skyline query strategies cause enormous computational costs and energy consumption. To ensure the efficient use of sensor energy, a geometry-based distributed sp...

متن کامل

Finding Skylines for Incomplete Data

In the last decade, skyline queries have been extensively studied for different domains because of their wide applications in multi-criteria decision making and search space pruning. A skyline query returns all the interesting points in a multi-dimensional data set that are not dominated by any other point with respect to all dimensions. However, real world data sets are seldom complete, i.e. d...

متن کامل

Approaching the Efficient Frontier: Cooperative Database Retrieval Using High-Dimensional Skylines

Cooperative database retrieval is a challenging problem: top k retrieval delivers manageable results only when a suitable compensation function (e.g. a weighted mean) is explicitly given. On the other hand skyline queries offer intuitive querying to users, but result set sizes grow exponentially and hence can easily exceed manageable levels. We show how to combine the advantages of skyline quer...

متن کامل

Simultaneous Processing of Multi-Skyline Queries with MapReduce

With rapid increase of the number of applications as well as the sizes of data, multi-query processing on the MapReduce framework has gained much attention. Meanwhile, there have been much interest in skyline query processing due to its power of multi-criteria decision making and analysis. Recently, there have been attempts to optimize multi-query processing in MapReduce. However, they are not ...

متن کامل

A fast and progressive algorithm for skyline queries with totally- and partially-ordered domains

We devise a skyline algorithm that can efficiently mitigate the enormous overhead of processing millions of tuples on totallyand partially-ordered domains (henceforth, TODs and PODs). With massive datasets, existing techniques spend a significant amount of time on a dominance comparison because of both a large number of skyline points and the unprogressive method of skyline computing with PODs....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016